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Mutual information is a nonlinear measure used in time series analysis in order to measure the 
linear and non- linear correlations at any lag r. The aim of this study is to evaluate some of the most 
commonly used mutual information estimators, i.e. estimators based on histograms (with fixed or 
adaptive bin size), fc- nearest neighbors and kernels. We assess the accuracy of the estimators by 
Monte-Carlo simulations on time series from nonlinear dynamical systems of varying complexity. 
As the true mutual information is generally unknown, we investigate the existence and rate of 
consistency of the estimators (convergence to a stable value with the increase of time series length), 
and the degree of deviation among the estimators. The results show that the fc-nearest neighbor 
estimator is the most stable and less affected by the method-specific parameter. 

PACS numbers: Valid PACS appear here 



I. INTRODUCTION 

Mutual information is a popular nonlinear measure of 
time series analysis, best known as a criterion to select 
the appropriate delay for state space reconstruction 
It is also used to discriminate different regimes of non- 
linear systems [2|, and to detect phase synchronization 
0,1^. Besides nonlinear dynamics, it is used in various 
statistical settings, mainly as a distance or correlation 
measure in data mining, e.g. in independent component 
analysis and feature-based clustering 0] ■ 

It is well-known that any estimate of mutual informa- 
tion, either between two variables or as a function of 
delay for time series, is (almost always) positively biased 
|1)S[I3|- For numerical- valued variables and time series, 
the mutual information increases with finer partition de- 
pending on the underlying distribution or process and the 
sample size. Beyond the classical histogram-based par- 
titioning, other schemes have been used to estimate the 
densities inherent in the measure of mutual information, 
e.g. using kernels and fc-nearest neighbors pTl. [T3|. 

Although there are some works co mparing mutual in- 
formation estimators in 0, [ll [ll [11, [ll [M [l3, [S [11 , 
to the best of our knowledge, there has not been a com- 
parison of all commonly used estimators, including the 
selection of their parameters, on time series from dynam- 
ical deterministic systems. 

Moon et al. developed a kernel mutual information 
estimator, as an extension of Silverman's work [20j. This 
estimator is compared to the locally adaptive histogram- 
based estimator of Fraser and Swinney [2l| on four linear 
and nonlinear systems using as a performance criterion 
the lag of the first minimum of mutual information. Dar- 
bellay and Vajda [l3| suggested an adaptive histogram- 
based estimator and compared it with mutual informa- 
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tion estimators derived from maximum likelihood esti- 
mators for some bivariate distributions with analytically 
known mutual information. Steuer et al. [l^ presented 
three histogram-based estimators, investigated their bias 
and suggested using the kernel estimator. Daub et al. 

estimated mutual information using B-spline func- 
tions and cornpared it to an entropy estimator suggested 
by Paninski [lO[ and a kernel density estimator on data 
sets drawn from a known distribution. They claimed 
that their method is computationally faster than the ker- 
nel estimator and improves the simple binning method. 
Kraskov et al. [ll developed an estimator of mutual in- 
formation based on fc-nearest neighbors and compared it 
to the adaptive histogram-based estimator of Darbellay 
and Vajda but only on Gaussian and some non-Gaussian 
distributions with analytically known mutual informa- 
tion. In [ll, equidistant and equiprobable histogram- 
based estimators (using three selection criteria for the 
number of bins b) are compared to the algorithm of Fraser 
and Swinney on nonlinear systems as to their robustness 
in detecting a fixed delay for the first minimum of the 
mutual information (similarly to They also use the 

bias as a performance criterion in the case of Gaussian 
processes (where the true mutual information is known) 
and find that the equiprobable histogram-based estima- 
tor is more accurate and the Fraser and Swinney estima- 
tor is computationally ineffective. 

In a different setting, in [l3| the estimators of equidis- 
tant histograms, kernels, B-splines, and fc-nearest neigh- 
bors, are tested on electroencephalographic data from 
rats in order to find dependencies between left and right 
channels. Using the surrogate data test for the signifi- 
cance of dependence and bootstrap confidence intervals 
for the estimators, they concluded that the B-spline es- 
timator is largely affected by its parameter and the fc- 
nearest neighbor is the most consistent and less depen- 
dent on its parameter. Trappenberg et al. com- 
pared the equidistant histogram-based method, the adap- 
tive histogram-based method of Darbellay and Vajda and 
the Gram- Char lier polynomial expansion [23| and con- 
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eluded that all three estimators gave reasonable estimates 
of the theoretical mutual information, but the adaptive 
histogram-based method converged faster with the sam- 
ple size. A more comprehensive evaluation of mutual 
information estimators including the kernel, /c-nearest 
neighbor, equiprobable histogram-based estimators and 
an estimator using the Edgeworth approximation to es- 
timate densities, was recently presented in [19J, focusing 
on the deviation of the mutual information from a linear 
correlation measure on linear and nonlinear time series 
(using the Henon map for the chaotic case) . In the same 
paper, a small scale simulation showed dependence of the 
performance of the kernel and nearest neighbor estima- 
tors on their parameter. 

Given the varying results in the literature on the dif- 
ferent mutual information estimators, we study here the 
performance of three commonly used estimators, i.e. es- 
timators based on histograms (with fixed or adaptive bin 
size), fc- nearest neighbors and kernels, on time series from 
nonlinear deterministic systems. Moreover, we investi- 
gate the optimal parameter for the determination of the 
two-dimensional partitioning for each method. Monte- 
Carlo simulations of dynamical systems of varying com- 
plexity and observational noise level are used in order to 
assess the accuracy of the estimators. Commonly used 
parameter selection methods are considered for all but 
the kernel estimators, where a range of bandwidths are 
tested. 

The structure of the paper is as follows. In Section HIl 
the estimators considered in this study are presented. In 
Section IIIII the results of the simulations are presented 
and in Section IIVI the results are discussed and conclu- 
sions are drawn. 



II. ESTIMATORS OF MUTUAL INFORMATION 

In information theory, mutual information is defined 
as a measure of mutual dependence of two variables X 
and Y and has the form 

IiX,Y)= [ [ f^,y{x,y)\og,I^£f^2lLAxdy (1) 

JxJy fx{x)fY{y) 

where fx,Y{x, y) is the joint probability density function 
(pdf) of X and Y, and fx{x), fviy) are the marginal 
pdfs of X and Y, respectively. The units of information 
of I{X, Y) depend on the base of the logarithm, e.g. bits 
for the base of 2 and nats for the natural logarithm in 

©• 

Assuming a partition of the domain of X and Y the 
double integral becomes a sum over the cells of the two- 
dimensional partition: 

IiX,Y) = J2px,y {^,J)\og, ^^^^r^ (2) 

^ Px{i)PY[J) 

where px{i): Py(j)j a-nd Px.Y^hj) a-re the marginal and 
joint probability distributions over the elements of the 



partition. In the limit of fine partitioning the expression 
in converges to ([T]). This may partly justify the abuse 
of notation of mutual information for the continuous and 
the discretized variables. 

It is always I{X, Y) > 0, with equality holding for inde- 
pendent variables, and 1{X, Y) < H{X) < log^ n (Jensen 
inequality), where H{X) = -Yn=iPi^i)^'^&a^/pi^i) is 
the entropy of X. We do not discuss the mutual infor- 
mation in terms of entropies as the estimation of mutual 
information we study in this work boils down to the es- 
timation of the densities in |T]) or probabilities in ([2|) . 

For a time series {Xt}"^i, sampled at fixed times r^, 
the mutual information is defined as a function of the 
delay t assuming the two variables X — Xt and Y = 

Xt-r, i.e. IiT)^l{Xt,Xt-r). 

The true mutual information is generally not known as 
joint and marginal probability density functions are un- 
known. A different estimator /(r) of I(t) is determined 
from the way the theoretical densities in ([1]) or probabili- 
ties in ([2]) are estimated. We discuss below three estima- 
tors considered in this work and their dependency on a 
parameter inherent in the estimation of the densities or 
probabilities. 



A. Histogram-based estimators 

The naive histogram-based estimator regards a parti- 
tion of the range of values of each variable into b discrete 
bins of equal length, termed as equidistant partitioning 
(ED). The density at each bin and each two-dimensional 
cell, or rather the probability functions in are esti- 
mated by the corresponding relative frequency of occur- 
rence of samples in the bin or cell. Many different criteria 
have been developed for the selection of the number of 
bins b or equivalently the length of each bin, e.g. see 
[23l . [13, m, [IBl ■ Alternatively, the partition can be done 
into equiprobable bins, so that each bin has the same 
occupancy, termed as equiprobable partitioning (EP). In 
any case the partitioning is the same for both variables 
and the only free parameter is b. A number of criteria 
for selecting 6 for one-dimensional binning can been used 
and in 16] the Cochran condition (requiring at least 5 
samples in a bin) extended for two-dimensional cells is 
used to select b. 

An extension of the equidistant and equiprobable 
partitioning is the adaptive partitioning of the two- 
dimensional plane. Darbellay and Vajda built an al- 
gorithm to estimate the mutual information (AD) by 
calculating relative frequencies on appropriate parti- 
tions formed in a way that conditional independence is 
achieved on the cells [l3| . The advantage of the estima- 
tor of Darbellay and Vajda is that it is adaptive to the 
data and does not involve a parameter for the binning. It 
does however involve a parameter for the independence 
test that can affect the performance of the estimator. 
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B. fc- nearest neighbor estimator 

Kraskov et al. proposed an estimator of mutual in- 
formation that uses the distances of fc-nearest neighbors 
to estimate the joint and marginal densities (KNN). 
For each reference point from the bivariate sample, a dis- 
tance length is determined so that k neighbors are within 
this distance length. Then the number of points with dis- 
tance less than half of this length give the estimate of the 
joint density at this point and the respective neighbors in 
one-dimension give the estimate of the marginal density 
for each variable. The algorithm uses discs (or squares 
depending on the metric) of a size adapted locally and 
then uses the corresponding size in the marginal sub- 
spaces, so in some sense the estimator is data adaptive. 
However, it involves as a free parameter the number of 
neighbors fc; large fc regards a small b of the histogram- 
based estimator. However, the estimator does not use a 
fixed neighborhood size and therefore there is not a clear 
association of k and b. 

C. Kernel estimator 

The kernel density estimator constructs a smooth es- 
timate of the unknown density by centering kernel func- 
tions at the data samples; kernels are used to obtain the 
weighted distances [ll|, [20] (KE) . The kernels essentially 
weight the distances of each point in the sample to the 
reference point depending on the form of the kernel func- 
tion and according to a given bandwidth h, so that small 
h produces more detail in the density estimate. Thus h 
plays the role in the kernel estimator that b plays in the 
histogram-based estimator (e.g. a rectangular kernel as- 
signs a histogram) and actually h has an inverse relation 
to b. It's advantage over histogram-based estimators is 
that it is independent of the location of the bins. Among 
the different kernel functions, Gaussian kernels are most 
commonly used and we use them here as well. This es- 
timator involves actually two free parameters: the band- 
width hi for the estimation of the marginal densities and 
the bandwidth /i2 for the estimation of the joint density. 
Kernel estimators are considered to be the most appro- 
priate for density estimation of one-dimensional data, but 
this does not necessary imply that the kernel estimator 
of mutual information is also the most appropriate. 

III. SIMULATIONS AND RESULTS 
A. Simulation setup 

The evaluation of the estimators is assessed by Monte- 
Carlo simulations on the following chaotic systems: 
Henon and Ikeda map, and Mackey Glass differential sys- 
tem with delay A = 17, 30, 100 regarding increasing com- 
plexity with A (the sampling time is 17s). The factors 
considered are the time series length n, given in a power 



of 2 from 8 to 13, and the noise level, i.e. the standard de- 
viation of additive Gaussian noise is 20%, 40% and 80% 
of the standard deviation s of the data, /(r) is com- 
puted using all methods on 1000 realizations from the 
above systems up to a lag t for which /(r) converges to 
a non-negative constant value. 

For each method, the corresponding free parameter 
covers a wide range of values. For the ED and EP estima- 
tors we set the number of bins to be 6 = 2, 4, 8, 16, 32, 64. 
The same values are set for the parameter fc of the KNN 
estimator. 

For the KE estimator we take 15 different values of 
hi from 0.01 to 2 with increment in logarithmic scale 
and /i2 — hi or /12 — ^/2hi (in order to account for the 
increase of distance from one to two dimensions using the 
Euclidean norm). Note that the values of h regard the 
standardized data. We also consider some well-known 
criteria for the selection of bandwidth given in Table [D 
The three first criteria define bandwidth for both one and 
two-dimensional space. For the last three criteria we set 
the two-dimensional bandwidth /12 to be either equal to 
hi or multiplied with ^/2. 

TABLE I: Criteria for the selection of bandwidths for one 
(hi) and two (/12) dimensions and the reference of the source 
(last column). The parameters in the expressions are a = 
1.8-r (1) if n < 200 and a = 1.5 if n >= 200, where r (1) is the 
autocorrelation at lag \, R — 1/2^, IQ is the interquartile 
range of the data. 
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B. Evaluation criteria 

I(t) is generally not known for non-linear chaotic sys- 
tems. In order to evaluate the mutual information esti- 
mators, we examine their consistency and their depen- 
dence in the corresponding parameters for all systems 
and time series lengths. Regarding consistency, an esti- 
mator is consistent if it converges to the true value with 
n. In the lack of the true mutual information, we use as 
a reference value the /(t) computed on a realization of 
length n — 10*" for each system (for some systems and 
estimators we considered n = 10^ or n = 10^ for time 
consuming reasons) . Specifically, for time series from the 
flows of the Mackey Glass systems we focus also on the 
first minimum of the estimated mutual information and 
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examine the dependence of this on the free parameter of 
the estimator and the time series length. 



C. Equidistant estimator 

The ED estimator is heavily dependent on the selection 
of the number of bins b. The Monte Carlo simulations 
showed that /(t) increases with b even for very large time 
series for all systems. Moreover, for fixed b the estimator 
seems to be over n only for very small lags, whereas for 
larger r, /(r) decreases constantly with n, as shown in 
Fig. [1] for the Henon system. For smaller values of 6, 
/(r) is rather stable for all n but gives poor estimation 
close to the zero level for all lags. The simulation on the 
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For Mackey-Glass system, the /(r) is computed for a 
range of lags that include the lag of first minimum of 
mutual information in order to assess also the accuracy 
of the estimator in detecting this particular lag. We ob- 
served that although /(r) increases with b, the lag of the 
first minimum of mutual information does not vary with 
b. Moreover, the estimate of this lag is rather stable with 
n, as shown in Fig[3]for noise-free data and A = 17. 
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FIG. 3: Average of /(r) from the equidistant estimator for 
lags T — 1, . . . , 10 from 1000 realizations of the Mackey Glass 
system (sampling time 17) with A = 17 of length n — 1024 
in (a) and /(r) from one realization of length n = 10® in (b). 



FIG. 1: Average of /(r) from the equidistant histogram-based 
estimator for lags r = 1, . . . , 10 from 1000 simulations of the 
Henon map for different number of bins (as in the legend) and 
for lengths n = 1024 in (a) and n = 10^ in (b). 

different systems showed that the equidistant estimator 
depends on b and n, differently for small and large lags. 

With the addition of noise, variations in the estimated 
mutual information in terms of n and b become smaller. 
However, this is expected as I{t) decreases with the in- 
crease of noise level (see FigHJ . 
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FIG. 2: /(r) from the equidistant estimator for lags r = 
1, . . . , 10 from one realization of the Henon map of length 
n = 10'', for noise level 20% in (a) and 40% in (b). 

The inclusion of stronger noise component masks the 
deterministic structure and thus levels the estimate of 
mutual information towards zero. On the other hand, 
the benefit of noise in terms of stable estimation, is that 
I{t) gets less dependent on b, e.g. in Fig[2]the estimate 
converges when & > 16 for noise level at 20% and 6 > 8 
for noise level at 40%. 



D. Equiprobable estimator 

The EP estimator has the same form of dependence on 
b, n and r as the ED estimator. Moreover, the estimated 
I(t) values from the two estimators may vary for small 
n but converge with n, as shown for the noise-free data 
from the Ikeda system in Fig|4l 





FIG. 4: Average of /(r) from the equidistant estimator (black 
lines) and the equiprobable estimator (gray lines, cyan in the 
color online) for lags r = 1, . . . , 10 from 1000 realizations of 
the Ikeda map for different number of bins (as in the legend), 
for lengths n = 256 in (a) and n = 8192 in (b). 

Other works seem to agree with the result that estima- 
tors using a fixed partition heavily depend on the selec- 
tion of b [3i di [13 • Given that the chaotic systems have 
significant mutual information at least for very small lags 
and that the ED and EP estimators get close to zero for 
small number of bins, we conclude that a small 6 is a good 
choice for these estimators only for independent time se- 
ries. (We observed the same for linear stochastic systems 
in a different study.) To the contrary, for chaotic sys- 
tems a refined partition explores in more detail the fine 



5 



structure of the distribution (marginal and joint) and 
gives better estimate of the true mutual information. We 
would expect that in the limit of fine partition, where the 
expression of the discretized version of mutual informa- 
tion in ^ converges to the true mutual information in 
(P), /(r) would converge as well, but this would require 
an infinite amount of data. Thus a good choice of bin 
width should balance a large b with the limited size n of 
the available data in order to maintain a good estimation 
of the probabilities in in particular at regions with 
low data densities that may carry valuable information 
for the system dynamics. For a fixed b, the ED and EP 
estimators are rather stable with respect to n, i.e. they 
are consistent estimators of the mutual information as 
defined in ([2]) for the specific partition determined by b. 



The AD estimator of Darbellay and Vajda is appar- 
ently independent of a parameter for the partitioning. 
However, it has a direct dependence on n, which deter- 
mines the roughness of the partitioning in a somehow 
automatic way. In the abundance of data, the AD esti- 
mator reaches a very fine partition that satisfies the inde- 
pendence condition in each cell, so that the total number 
of cells is very large (analogously to a fixed-partition with 
a respectively large b). Thus the increase of n implies a 
finer partition and explains the increase of /(r) from the 
adaptive estimator with n, as shown for the Henon map 
in Fig[5^. Note that the dependence of AD on n is not 
comparable to that of the fixed-bin estimators because it 
involves a change of partitioning with n. In this sense, 
AD is not consistent with respect to n. However, in the 
presence of noise, the effect of n on the adaptive estima- 
tor decreases with the noise level (see FiglSJj). 
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FIG. 5: Average of /(r) from the adaptive histogram-based 
estimator for lags r = 1, . . . , 10 from 1000 realizations of the 
Henon map for different lengths n (as in the legend) , for noise- 
free data in (a) and noisy data at 20% noise level in (b). 

AD is considered to be one of the most precise and 
efficient algorithms for estimating the mutual informa- 
tion that converges fast to the true mutual information, 
basically for certain distribution and Gaussian processes 
where this is analytically given [13, EB]- However, in the 
case of nonlinear systems the estimator does not seem to 
converge with n, unless the fine partition is limited by 



the presence of noise. 

F. fc-nearest neighbor estimator 

The parameter of the number of nearest neighbors k 
in this estimator determines the roughness of approxima- 
tion of the density functions in ([l]), which corresponds 
to the roughness of the partitioning in ([2]). Thus for a 
fixed n, as /(r) from fixed-bin estimators increase with 
b (finer partition), the /(r) from KNN increases when k 
decreases, as shown in Fig|6^ and b. This dependence 
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FIG. 6: Average of /(r) from the fc-nearest neighbor estimator 
for lags T = 1, . . . , 10 from 1000 realizations of the Henon 
map. In (a) and (b) the results are displayed for different k 
(as in the legend) and for n — 256 and n = 8192, respectively. 
In (c) and (d) the results are displayed for different n (as in 
the legend) and for k — 2 and k = 32, respectively. 

persists also for very large n, whereas for a small time 
series a large k gives a poor estimation of the densities 
and consequently of /(r) (see Fig|B^, particularly when 
k — 64: and n = 256). 

We also observed that /(r) increase with n for a fixed 
k (see FiglB}; and d). Assuming a fixed parameter {k and 
6) the effect of n on KNN is larger than on the fixed-bin 
estimators and similar to the effect of n on the adaptive 
histogram-based estimator. 

In agreement with the histogram-based estimators, 
KNN decreases with the noise level (see Figl?]). However 
in the presence of noise, the dependence of the KNN on k 
is less than the dependence of the ED and EP on b (e.g. 
for the case of Henon map with n = 10^ and 20% noise 
level compare FiglTlD to Fig[^). This results is in agree- 
ment with Nicolaou et al. [17j that found independence 
of this estimator on k for EEG data. 

An upper limit for k is set by n whereas the lower limit 
k — I corresponds to the finest partition we could get 
for a histogram-based estimator. Thus the restriction to 
small k proposed by Kraskov et al. jl^] and used in other 



1.5 

E. Adaptive histogram-based estimator ~ i 



6 



(a) 



(b) 



(a) 



(t>) 







= 2 


0.6 




■■■k = 4 
-^k = 8 
-*-k= 16 


_0.4 




— k.32 
-•-k=64 


0.2 














— k = 2 


40 




■■■k = 4 






-^k = 8 


30 




^-k= 16 






— k = 32 






-•-k=64 


E20 




10 










»h,=0.01 


20 




0.021 






-^0.045 


15 




0.096 




-*-0.206 






■■■0.44 


=•10 




0.938 






-«-2 


5 


1 i 1 1 1 1 1 J 








»h,=0.01 




■■ 0.021 




-^0.045 




0.096 




-*-0.206 




■■■0.44 




0.938 




-«-2 


i ' t I I k I I 





FIG. 7: Average of /(r) from the fc-nearest neighbor estimator 
for lags r = 1, . . . , 10 from 1000 realizations of the Henon map 
with 20% noise of length n = 1024 in (a) and J(r) from one 
realization of length n = 10^ in (b). 



simulation studies @, [3, corresponds to fine partitions. 
Thus for chaotic systems that require fine partitioning, a 
fair comparison of the KNN for these k values to ED and 
EP would require a very large b. To this respect, this 
simulation comparison is restricted to comparable small 
b (up to 64) due to computational limitations. 



G. Kernel estimator 

Among different kernel functions used in the liter- 
ature for density estimation, and for mutual informa- 
tion estimation in particular, the common practice is to 
use the Gaussian kernel in conjunction with the "Gaus- 
sian" bandwidth of Silverman [ll| or multiplies of it 
[13, [1^, [13| ■ There are some other criteria for bandwidth 
selection that we have included in this study (see Ta- 
ble Moreover, we test the KE estimator of mutual 
information also for a range of bandwidths as for the 
other estimators. 

First we investigate the inter-dependence in the se- 
lection of hi and /i2 across the selected range of band- 
widths. As shown in Figl8]for the Henon map, selecting 
/i2 = V2hi instead of /i2 — hi simply decreases the value 
of /(t), without alerting the form of the dependence of 
/(r) on T and this occurs independently of n. So, the 
roughness of the partitioning is determined by hi, i.e. 
smaller hi implies finer partitions or small neighborhoods 
with respect to KNN. We note however that /(r) from 
KEreaches very small values for as large hi as 2 and very 
large values for as small hi as 0.01. Note that such ex- 
tremely large values of /(r) do not occur by any other 
estimator. 

Regarding the 9 criteria selecting hi (and at cases /12, 
see Table IJ), the estimated bandwidths vary but within a 
small range (for n = 512 in Fig. [3^ they are bounded in 
[0.1,0.3] except C3 that always gives larger bandwidths 
where in this case is hi ~ 0.64). All criteria depend on 
n in a similar way and estimate smaller bandwidths as n 
increases giving larger /(r) (see Fig. ^ and b). Thus a 
kernel estimator using a specific criterion turns out not 
to be consistent. 

When noise is added to the time series /(r) decreases 
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FIG. 8: Average of /(r) from the kernel estimator for lags 
r = 1, . . . , 10 from 1000 reahzations of the Henon map for a 
range of different values of bandwidth hi (as in the legend) . 
(a) n = 512 and h2 = h\, (b) n — 512 and /i2 ~ V2hi, (c) 
n = 8192 and /12 = /ii, (d) n = 8192 and /i2 = V2hi 
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FIG. 9: Average of /(r) from the kernel estimator for lags 
r = 1, . . . , 10 from 1000 realizations of the Henon map from 
the nine bandwidth selection criteria for noise-free data and 
for n — 512 in (a) and n — 8192 in (b) and for 20% noise level 
and n — 512 in (c) and n — 8192 in (d). 



and differences with respect to the partitioning parame- 
ters are smaller, as observed in the other estimators. This 
holds also when a specific bandwidth selection criterion 
is used, and in particular the estimated /(r) is rather 
stable to the change of n (see Fig. [9]: and d). 

In the estimation of mutual information with kernels, 
the range of bandwidths is usually not searched and a 
bandwidth is selected according to a criterion such as 
the "Gaussian" bandwidth 14]. However, our simula- 
tions have shown that KE estimator is strongly depen- 
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dent on the bandwidth that defines the partition differ- 
ently according to the sample size, and these findings are 
in agreement with other works [30, [Sll ■ 



H. Evaluation of estimators and their parameters 

The results on the different estimators have shown a 
varying sensitivity of the estimator to its free parameter, 
where histogram-based estimators turned out to be the 
most sensitive. However, there seems to be a loose corre- 
spondence among the different free parameters b, k and 
hi, /i2 depending also on the time series length. Thus the 
differences in the performance of the estimators can be 
explained to some degree by the coarseness of the par- 
tition as determined by its free parameter. The choice 
of b for the histogram-based estimators determines the 
bin size of the partition. The analogue of the bin size 
for the fc-nearest neighbors estimator is the size of neigh- 
borhoods and for the kernel estimator is the size of the 
efScient support for the kernel given by the bandwidth. 

In order to check the correspondence of the estimator 
specific parameters, first we set a specific bin length r for 
the partition given by b. For KNN the parameter k is 
set to the average number of neighbors within each disc 
of diameter r. For KE we set h2 — r/2. In this way, we 
attempt to match the partition for each estimator. How- 
ever, this selection scheme for the parameters does not re- 
sult in similar estimated values of /(t). For example, for 
a time series of the Henon map with length n = 512 when 
6 = 16 the estimates of /(r) seem to agree at some degree 
where as when & = 32 they vary significantly, as shown 
in Figlini When the standard criteria for the selection of 
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estimates of J(t) vary even more (see FigfTOb). For a bit 
larger length a,s n — 1024 we observed that if we choose 
a bit larger value of 5 as 6 = 20, we get similar estimated 
values of /(r)(see FigHUJi). 

[to revise this according to the results on 
differen n and for the other systems.] We con- 
cluded that the optimization of parameters is very 
crucial even more than the choice of the estimator, as we 
can see that no estimator exhibits consistency especially 
in the case of noise-free data. 

It is also important to compare the computational cost 
of the estimators. The kernel estimator has the high- 
est computational cost and the computation time for the 
histogram-based estimators is also prohibitive for very 
large values of b. The adaptive estimator has the advan- 
tage of being fast and parameter-free but tends to give 
larger /(r) (compared to the other estimators) for small 
noise-free time series. 

Most of the results of the different estimators are illus- 
trated for the Henon map in order to facilitate compar- 
isons, but qualitatively similar results are obtained from 
the simulations on the Ikeda map and the Mackey- Glass 
system. 



IV. DISCUSSION 

Mutual information estimators are not consistent for 
non-linear noise-free systems and the choice of parame- 
ters is crucial for all estimators. We cannot find an opti- 
mal parameter choice as there is no consistency. However 
with addition of noise in the systems, the choice of the 
parameters is not that crucial as there is convergence of 
the estimated /(r) values and all estimators seem to be 
consistent especially for larger time series lengths, k- 
nearest neighbor estimates of /(r) varies less with the 
free parameter (fc) compared to the other estimators. 

As a general conclusion we can say that all estimators 
depend on the parameter choice and therefore is crucial 
to optimize their parameter. Also consistency of estima- 
tors for linear systems is not indicative of the estimators 
behavior for nonlinear systems. Although consistency of 
estimators is claimed in many past works might be due 
to inclusion of only linear systems in their evaluation or 
presence of noise (e.g. real data). 

However, this shows that the fc-nearest neighbor esti- 
mator is computationally more effective when fine parti- 
tions are sought, due to the use of effective data struc- 
tures in the search for neighbors. 



FIG. 10: /(r) from one realization of the Henon map of from 
all estimators, for parameters as in the legends for n = 512 in 
(a), (b) and (c) and n = 1024 in (d). 

the parameters are considered, i.e. b = -^/n/S, fc = 3, and 
bandwidths hi , /12 given by the Silverman's criterion, the 
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